ReneWind

Renewable energy sources play an increasingly important role in the global energy mix, as the effort to reduce the environmental impact of energy production increases.

Out of all the renewable energy alternatives, wind energy is one of the most developed technologies worldwide. The U.S Department of Energy has put together a guide to achieving operational efficiency using predictive maintenance practices.

Predictive maintenance uses sensor information and analysis methods to measure and predict degradation and future component capability. The idea behind predictive maintenance is that failure patterns are predictable and if component failure can be predicted accurately and the component is replaced before it fails, the costs of operation and maintenance will be much lower.

The sensors fitted across different machines involved in the process of energy generation collect data related to various environmental factors (temperature, humidity, wind speed, etc.) and additional features related to various parts of the wind turbine (gearbox, tower, blades, break, etc.).

Objective

“ReneWind” is a company working on improving the machinery/processes involved in the production of wind energy using machine learning and has collected data of generator failure of wind turbines using sensors. They have shared a ciphered version of the data, as the data collected through sensors is confidential (the type of data collected varies with companies). Data has 40 predictors, 40000 observations in the training set and 10000 in the test set.

The objective is to build various classification models, tune them, and find the best one that will help identify failures so that the generator could be repaired before failing/breaking to reduce the maintenance cost. The different costs associated with maintenance are as follows:

“1” in the target variables should be considered as “failure” and “0” will represent “No failure”.

Data Description

Importing libraries

Loading Data

Observations

EDA

Univariate Analysis

Plotting histograms and boxplots for all the variables

Plotting all the features in one go with a for loop

Univariate Observations

Bivariate Analysis

Observations

Data Pre-processing

Model Building

Model evaluation criterion

The nature of predictions made by the classification model will translate as follows:

Which metric to optimize?

Defining a function to output different metrics (including recall) on the train and validation set, and a function to show confusion matrix to avoid writing code each time.

Defining scorer to be used for cross-validation and hyperparameter tuning

Model Building with original data

Observations on original data

After checking the performance, reviewing the confusion matrices of the train and validation data

Observations on the Confusion Matrix on the original data

Model Building with Oversampled data

Observations on oversampled data

After checking the performance, reviewing the confusion matrices of the train and validation data

Observations on the Confusion Matrix on the oversampled data

Model Building with Undersampled data

Observations on undersampled data

After checking the performance, reviewing the confusion matrices of the train and validation data

Observations on the Confusion Matrix on the undesampled data

Selecting the top three models for tuning

HyperparameterTuning

Sample Parameter Grids

Hyperparameter tuning can take a long time to run, so to avoid that time complexity - you can use the following grids, wherever required.

param_grid = { "n_estimators": np.arange(100,150,25), "learning_rate": [0.2, 0.05, 1], "subsample":[0.5,0.7], "max_features":[0.5,0.7] }

param_grid = { "n_estimators": [100, 150, 200], "learning_rate": [0.2, 0.05], "base_estimator": [DecisionTreeClassifier(max_depth=1, random_state=1), DecisionTreeClassifier(max_depth=2, random_state=1), DecisionTreeClassifier(max_depth=3, random_state=1), ] }

param_grid = { 'max_samples': [0.8,0.9,1], 'max_features': [0.7,0.8,0.9], 'n_estimators' : [30,50,70], }

param_grid = { "n_estimators": [200,250,300], "min_samples_leaf": np.arange(1, 4), "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'], "max_samples": np.arange(0.4, 0.7, 0.1) }

param_grid = { 'max_depth': np.arange(2,6), 'min_samples_leaf': [1, 4, 7], 'max_leaf_nodes' : [10, 15], 'min_impurity_decrease': [0.0001,0.001] }

param_grid = {'C': np.arange(0.1,1.1,0.1)}

param_grid={ 'n_estimators': [150, 200, 250], 'scale_pos_weight': [5,10], 'learning_rate': [0.1,0.2], 'gamma': [0,3,5], 'subsample': [0.8,0.9] }

Tuning method for GradientBoost with oversampled data

Tuning method for GradientBoost with undersampled data

Tuning method for AdaBoost with undersampled data

Model performance comparison and choosing the final model

Observations on Hyperparameter Tuned Models

Test set final performance

Observations

Pipelines to build the final model

Business Insights and Conclusions